Data Mining on Cloud Computing Platforms - Challenges and Solutions
Dr. Yi Pan

Distinguished University Professor
Department of Computer Science
Georgia State University
Atlanta, Georgia, USA

Abstract: Cloud computing has emerged rapidly as a growing paradigm of on-demand access to computing, data and software utilities using a usage-based billing model. Users essentially rent resources and pay for what they use and everything including software, platform, and infrastructure is as a service. Many massive data applications including data mining should be the ideal applications on cloud platforms. However, with the current cloud programming models, complicated data mining algorithms cannot be implemented easily and executed efficiently on the many cloud platforms. In this talk, I will give a review of different massively parallel computing platforms and compare various computing domains and programming models on these platforms, their limitations and potential solutions, especially to data mining applications. In particular, I will point out the shortcomings and limitations of current cloud computing programming models for typical data mining algorithms, and propose possible solutions. Current MapReduce model and its variants have succeeded in data-parallel applications such as database operations and web searching; however, they are still not effective for applications with a lot of data dependency such as data mining and graph applications. We propose several approaches to solving this problem through extension of current programming models, automatic translation from sequential codes to cloud codes, simple API and framework built on current cloud models, detection of data and task parallelism, and their efficient scheduling. Some preliminary theoretical and experimental results will also be reported in this talk.